equation 4
Appendices
In Equation 4, maximization is carried out over the inputy to the inverse-map, and the input z which is captured inˆp in the above optimization problem, i.e. maximization overz in Equation 4 is equivalent to choosingˆp subject to the choice of singleton/ Dirac-deltaˆp. Since Equation 4 describes a constrained optimization problem, our approach towards solving this problem in practice is via dual gradient descent. Gradient descent is used to optimize the Lagrangian of Equation 4 (with the constraintp(z) 2 modified to belogp(z) 2 as it is easy to uselogp(z)numerically for stochasticoptimization),showninEquation5. Ateachiteration,itsamplesafunction from this distribution and queries the pointx?t that greedily minimizes this function. Information Ratio Russo and Van Roy[30] related the expected regret of TS to its expected information gain i.e. the expected reduction in the entropy of the posterior distribution ofX .
Sparse Multiple Kernel Learning: Alternating Best Response and Semidefinite Relaxations
Bertsimas, Dimitris, Iglesias, Caio de Prospero, Johnson, Nicholas A. G.
We study Sparse Multiple Kernel Learning (SMKL), which is the problem of selecting a sparse convex combination of prespecified kernels for support vector binary classification. Unlike prevailing l1 regularized approaches that approximate a sparsifying penalty, we formulate the problem by imposing an explicit cardinality constraint on the kernel weights and add an l2 penalty for robustness. We solve the resulting non-convex minimax problem via an alternating best response algorithm with two subproblems: the alpha subproblem is a standard kernel SVM dual solved via LIBSVM, while the beta subproblem admits an efficient solution via the Greedy Selector and Simplex Projector algorithm. We reformulate SMKL as a mixed integer semidefinite optimization problem and derive a hierarchy of semidefinite convex relaxations which can be used to certify near-optimality of the solutions returned by our best response algorithm and also to warm start it. On ten UCI benchmarks, our method with random initialization outperforms state-of-the-art MKL approaches in out-of-sample prediction accuracy on average by 3.34 percentage points (relative to the best performing benchmark) while selecting a small number of candidate kernels in comparable runtime. With warm starting, our method outperforms the best performing benchmark's out-of-sample prediction accuracy on average by 4.05 percentage points. Our convex relaxations provide a certificate that in several cases, the solution returned by our best response algorithm is the globally optimal solution.